Assignment Instructions:

Assignment submission: Henry Oliver____


library(tidyverse)
library(here)
library(janitor)
library(estimatr)  
library(performance)
library(jtools)
library(gt)
library(gtsummary)
library(interactions) 

Introduction

You’re about to dive into some deep data collected from five reef sites in Santa Barbara County, all about the abundance of California spiny lobsters! 🦞 Data was gathered by divers annually from 2012 to 2018 across Naples, Mohawk, Isla Vista, Carpinteria, and Arroyo Quemado reefs.

Why lobsters? Well, this sample provides an opportunity to evaluate the impact of Marine Protected Areas (MPAs) established on January 1, 2012 (Reed, 2019). Of these five reefs, Naples, and Isla Vista are MPAs, while the other three are not protected (non-MPAs). Comparing lobster health between these protected and non-protected areas gives us the chance to study how commercial and recreational fishing might impact these ecosystems.

We will consider the MPA sites the treatment group and use regression methods to explore whether protecting these reefs really makes a difference compared to non-MPA sites (our control group). In this assignment, we’ll think deeply about which causal inference assumptions hold up under the research design and identify where they fall short.

Let’s break it down step by step and see what the data reveals! 📊


Step 1: Anticipating potential sources of selection bias

a. Do the control sites (Arroyo Quemado, Carpenteria, and Mohawk) provide a strong counterfactual for our treatment sites (Naples, Isla Vista)? Write a paragraph making a case for why this comparison is ceteris paribus or whether selection bias is likely (be specific!).


A perfect counterfactual for our treatment sites would be identical in every way to the treatment sites, except for the implementation of an MPA. In this experiment, there are many differences between the sites other than the presence of the MPA. Geographically, Naples and Isla Vista are right next to each other, while the control sites are further spread out along the coast. These locations have different habitat quality, wave patterns, water temperature, geology, and nutrient availability, all of which could have an affect on lobster populations. Selection bias is likely present, because these locations were probably selected for financial, political, or ease of access reasons, rather than being randomly selected.

Step 2: Read & wrangle data

a. Read in the raw data from the “data” folder named spiny_abundance_sb_18.csv. Name the data.frame rawdata

b. Use the function clean_names() from the janitor package

# HINT: check for coding of missing values (`na = "-99999"`)

# Read in data, define -99999 as na
rawdata <- read_csv(here("data/spiny_abundance_sb_18.csv"), na = "-99999") %>% 
    clean_names() # Make columns names lower_case

c. Create a new df named tidyata. Using the variable site (reef location) create a new variable reef as a factor and add the following labels in the order listed (i.e., re-order the levels):

"Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista",  "Naples"
tidydata <- rawdata %>% 
    mutate(reef = factor(site, # Create new column called reef. Make it a factor based on site
                         levels =c("AQUE", "CARP", "MOHK", "IVEE", "NAPL"), #List values of site in order of new factor
                         # Create labels for new column in the same order as above
                         labels = c("Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista",  "Naples")))

Create new df named spiny_counts

d. Create a new variable counts to allow for an analysis of lobster counts where the unit-level of observation is the total number of observed lobsters per site, year and transect.

  • Create a variable mean_size from the variable size_mm
  • NOTE: The variable counts should have values which are integers (whole numbers).
  • Make sure to account for missing cases (na)!

e. Create a new variable mpa with levels MPA and non_MPA. For our regression analysis create a numerical variable treat where MPA sites are coded 1 and non_MPA sites are coded 0

#HINT(d): Use `group_by()` & `summarize()` to provide the total number of lobsters observed at each site-year-transect row-observation. 

#HINT(e): Use `case_when()` to create the 3 new variable columns

spiny_counts <- tidydata %>% 
    group_by(site, year, transect) %>% # Group data by site, year and transect
    summarize(
        counts = sum(count), # Create new column for counts of lobsters from new groups
        mean_size = mean(size_mm, na.rm = TRUE),
                         .groups = "drop") %>% # Create a new column for mean size, ungroup
    mutate(
    mpa = case_when( # Create mpa column
      site %in% c("NAPL", "IVEE") ~ "MPA", # label 'MPA' when site is NAPL or IVEE
      site %in% c("AQUE", "CARP", "MOHK") ~ "non_MPA"# Label 'non_MPA' when site is AQUE, CARP, or MOHK
    ),
    mpa = factor(mpa, levels = c("non_MPA", "MPA")), # Set non_MPA as reference
    treat = case_when(
      site %in% c("NAPL", "IVEE") ~ 1,
      site %in% c("AQUE", "CARP", "MOHK") ~ 0
    )
  )

NOTE: This step is crucial to the analysis. Check with a friend or come to TA/instructor office hours to make sure the counts are coded correctly!


Step 3: Explore & visualize data

a. Take a look at the data! Get familiar with the data in each df format (tidydata, spiny_counts)

b. We will focus on the variables count, year, site, and treat(mpa) to model lobster abundance. Create the following 4 plots using a different method each time from the 6 options provided. Add a layer (geom) to each of the plots including informative descriptive statistics (you choose; e.g., mean, median, SD, quartiles, range). Make sure each plot dimension is clearly labeled (e.g., axes, groups).

Create plots displaying the distribution of lobster counts:

  1. grouped by reef site
  2. grouped by MPA status
  3. grouped by year

Create a plot of lobster size :

  1. You choose the grouping variable(s)!
# plot 1: ....
# Count distribution by site - histogram
plot1 <- spiny_counts %>% 
ggplot(aes(x = counts)) + # display counts
    geom_histogram() +
    geom_vline(aes(xintercept = mean(counts), linetype = "Mean"), color = "red", size = 1) +
    labs(title = "Distribution of Lobster Counts by Site",
         x = "Lobster Counts",
         y = "Frequency") +
    scale_linetype_manual(name = "", values = c("Mean" = "dashed")) + # Add mean line
    facet_wrap(~site)+ # One plot for each site
    theme_minimal()

# Count distribution by MPA status - density plot
plot2 <- spiny_counts %>% 
  group_by(mpa) %>%
  mutate(med = median(counts)) %>%
  ggplot(aes(x = counts, fill = mpa)) + # grouped by MPA status
  geom_density(alpha = 0.6) +
  geom_vline(aes(xintercept = med, color = mpa, linetype = "Median"), size = 1) +
  labs(title = "Density Distribution of Lobster Counts by MPA Status",
       x = "Lobster Counts",
       y = "Density") +
  scale_fill_manual(values = c("#F26666", "#66F2F2")) + # custom colors
  scale_color_manual(values = c("#FF2626", "#1FA5C4")) +
  scale_linetype_manual(name = "", values = c("Median" = "dashed"))

# Count distribution by year - jitter plot
plot3 <- spiny_counts %>% 
    #  We want to color by year since jitter plots can have some confusing overlap. In order to do this year must be a factor
  ggplot(aes(x = counts, y = year, color = as.factor(year))) + 
  geom_jitter(size = 2, alpha = 0.7) +
  geom_vline(aes(xintercept = mean(counts), linetype = "Mean"), size = 1) +
  labs(title = "Distribution of Lobster Counts Across Years",
       x = "Lobster Counts",
       y = "Year") +
  scale_color_viridis_d(guide = "none") +  # Different color for each year
  scale_y_continuous(breaks = unique(spiny_counts$year)) +  # Label every year
  scale_linetype_manual(name = "", values = c("Mean" = "dashed")) # Key for stat

# Size distribution by MPA status - Violin plot
plot4 <- spiny_counts %>% 
  ggplot(aes(x = mpa, y = mean_size, fill = mpa)) + 
  geom_violin(alpha = 0.6) +
  stat_summary(fun = mean, aes(shape = "Mean"), geom = "point", size = 3, color = "black") +
  labs(title = "Distribution of Mean Lobster Size by MPA Status",
       x = "MPA Status",
       y = "Mean Size (mm)") +
  scale_fill_manual(values = c("#F26666", "#66F2F2"), guide = "none") +
  scale_shape_manual(name = "", values = c("Mean" = 16))

plot1

plot2

plot3

plot4

c. Compare means of the outcome by treatment group. Using the tbl_summary() function from the package gt_summary

# USE: gt_summary::tbl_summary()

spiny_counts %>%
  select(mpa, counts, mean_size, year) %>%  # Select variables to summarize
  tbl_summary(by = mpa)  # Group by mpa status
Characteristic non_MPA
N = 133
1
MPA
N = 119
1
counts 9 (4, 22) 13 (3, 34)
mean_size 73 (68, 77) 77 (71, 80)
    Unknown 15 12
year

    2012 19 (14%) 17 (14%)
    2013 19 (14%) 17 (14%)
    2014 19 (14%) 17 (14%)
    2015 19 (14%) 17 (14%)
    2016 19 (14%) 17 (14%)
    2017 19 (14%) 17 (14%)
    2018 19 (14%) 17 (14%)
1 Median (Q1, Q3); n (%)

Step 4: OLS regression- building intuition

a. Start with a simple OLS estimator of lobster counts regressed on treatment. Use the function summ() from the jtools package to print the OLS output

b. Interpret the intercept & predictor coefficients in your own words. Use full sentences and write your interpretation of the regression results to be as clear as possible to a non-academic audience.

The intercept coefficient value indicates the number of lobsters that we can expect at a non-MPA site. Specifically, our model indicates that we should expect to count an average of about 22.73 lobsters per year at a non-MPA site. The intercept indicates the change in the number of lobsters that we expect to see when we apply our treatment, or in this case, when we search for lobsters in an MPA. In this case, we expect an MPA site to about five more lobsters than a non-MPA site.

# NOTE: We will not evaluate/interpret model fit in this assignment (e.g., R-square)

m1_ols <- lm(counts ~ mpa, data = spiny_counts)

summ(m1_ols, model.fit = FALSE) 
Observations 252
Dependent variable counts
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 22.73 3.57 6.36 0.00
mpaMPA 5.36 5.20 1.03 0.30
Standard errors: OLS

d. Explain the results of the 4 diagnostic plots. Why are we getting this result?

check_model(m1_ols,  check = "qq" )

This first plot shows that our count distribution deviates from a normal distribution. This indicates that our counts are least consistent between sites near the lowest and highest bounds of our counts. We’re getting this result because our distribution is not necessarily normal - and some sites may have many more large or small lobsters than other sites.

check_model(m1_ols, check = "normality")

The second plot shows the density of our residuals compared to that of a normal distribution. Our results have the highest density in the low-residual range, but generally follow the normal curve. This means we have low error. We’re probably seeing this because most of our plots have a similar number of lobsters (3-10).

check_model(m1_ols, check = "homogeneity")

The third plot indicates that our errors are not consistent across our dataset. We’re seeing this likely because there is a lot of variation in counts in the middle of our count range (25-50).

check_model(m1_ols, check = "pp_check")

This plot indicates how well our model imitates the patterns found in our actual data. These results show that our model is not able to replicate the peak of our counts that are around 3-10. This is probably because we have so much data in this range, a simple lm can’t account for the dispersion.


Step 5: Fitting GLMs

a. Estimate a Poisson regression model using the glm() function

b. Interpret the predictor coefficient in your own words. Use full sentences and write your interpretation of the results to be as clear as possible to a non-academic audience.

In order to interpret these coefficients, we need to make some mathematical modifications. Because we used a logarithmic model, our outputs are still in logarithmic terms or “link space”. Basically, the results are in terms that the computer understands, but are difficult for humans to understand. In order to get them in to numeric terms (number of lobsters) we have to use the inverse function. In this case, we will raise e^value of our coefficients. When we do this, our intercept value becomes ~ 23, which means that in a non-MPA site we would expect to find 23 lobsters. Our treatment coefficient becomes ~24%, which means that in an MPA site we would expect to find 1.24 times more lobsters on average than a non-treatment site - or about five more lobsters (if we’re assuming our non-MPA has 3 lobsters).

c. Explain the statistical concept of dispersion and overdispersion in the context of this model.

Dispersion describe the distribution of our data compared to a “normal” distribution or a bell curve, which describes a distribution that has the most measurements near the mean, the least measurements at the minimum and maximum count. Over-dispersion refers to data that deviates significantly from a “normal” distribution. In our case, normally distributed data would mean that most of our sites have around 3 lobsters, very few sites have 0-1 lobsters, and very few sites have 25-50 lobsters.

Over-dispersed data would have much more spread than this—we’d see more sites with very low counts (1-3 lobsters) and more sites with very high counts (50-100 lobsters) than the Poisson model predicts. This extra variability is a problem because it can make our statistical tests unreliable. Overdispersion might happen if lobsters cluster together in certain spots due to rocky habitat or food availability, meaning some transects catch huge numbers while others catch almost none. Things like like storms, temperature changes, or local fishing could also cause lobster populations to vary much more dramatically from site to site than we’d expect from random chance.

d. Compare results with previous model, explain change in the significance of the treatment effect

The poisson model predicted a slightly lower effect in MPA sites. Our coefficient value dropped from about 5 to 1.28 between models. This is because the poisson model gives us a multiplicative (percentage increase) value for the treatment coefficient, which means if we had a site that had 3 lobsters, we would expect that if this site was an MPA it would instead have 1.28x the number of lobsters, or about 4 lobsters. The effect of the MPA is slightly lower for a population of 3 lobsters in our Poisson model than our OLS model - but very comparable.

#HINT1: Incidence Ratio Rate (IRR): Exponentiation of beta returns coefficient which is interpreted as the 'percent change' for a one unit increase in the predictor 

#HINT2: For the second glm() argument `family` use the following specification option `family = poisson(link = "log")`

m2_pois <- glm(counts ~ mpa, # mpa status is our predictor
                     family = poisson(link = "log"), # poisson regression, which uses log
                     data = spiny_counts)

# View the results
summary(m2_pois)
## 
## Call:
## glm(formula = counts ~ mpa, family = poisson(link = "log"), data = spiny_counts)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.12366    0.01819 171.744   <2e-16 ***
## mpaMPA       0.21184    0.02510   8.441   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 10438  on 251  degrees of freedom
## Residual deviance: 10366  on 250  degrees of freedom
## AIC: 11366
## 
## Number of Fisher Scoring iterations: 6

e. Check the model assumptions. Explain results. The PP check shows that our model is not very accurate at predicting count, especially in the 5-15 range of lobsters. The Misspecified dispersion and zero-inflation figure indicates that our data is potentially over-dispersed, and our model is not accounting for it. The Homogeneity of Variance figure shows that our variation is relatively homogenous, meaning that variation or grouping of lobsters is similar between treatment and non-treatment groups. The influential observations graph shows that our outlier observations of very low or very high counts of lobsters have significant influence on the accuracy of our model. The Distribution of Quantile residuals figure indicates that our residuals have high variance across our entire distribution. These results indicate that the model is best at predicting count in the 0.7-0.85 quantile range, and less effective to varying degrees at points in our distribution.

f. Conduct tests for over-dispersion & zero-inflation. Explain results.

Our results show that our dispersion ratio is 67. If our data was consistent with the Poisson regression, this value would be 1. This indicates that our data has 67 times the variation that the poisson model predicts - which means it is highly overdispersed. Our zero-inflation results show that we have 27 0-counts in our data, and poisson predicted 0. This means that our Poisson model is failing.

check_model(m2_pois)

check_overdispersion(m2_pois)
## # Overdispersion test
## 
##        dispersion ratio =    67.033
##   Pearson's Chi-Squared = 16758.289
##                 p-value =   < 0.001
check_zeroinflation(m2_pois)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 0
##             Ratio: 0.00

g. Fit a negative binomial model using the function glm.nb() from the package MASS and check model diagnostics

h. In 1-2 sentences explain rationale for fitting this GLM model.

Negative binomial model incorporates a dispersion parameter, which makes it better for modeling over-dispersed data.

i. Interpret the treatment estimate result in your own words. Compare with results from the previous model.

The treatment estimate indicates a 24% increase in lobsters at MPA sites. It is slightly lower than the estimate of lobster increase at MPA sites produced by our poisson model. This predicts that just over three more lobsters in MPAs than non-MPAs with 23 lobsters. This is slightly more than our OLS model predicted, which was a 21% increase in lobsters - from 23 (nonMPA) to 28 (MPA)

library(MASS) ## NOTE: The `select()` function is masked. Use: `dplyr::select()` ##
# NOTE: The `glm.nb()` function does not require a `family` argument
m3_nb <- glm.nb(counts ~ mpa, data = spiny_counts)

# View results
summary(m3_nb)
## 
## Call:
## glm.nb(formula = counts ~ mpa, data = spiny_counts, init.theta = 0.5500333101, 
##     link = log)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.1237     0.1183  26.399   <2e-16 ***
## mpaMPA        0.2118     0.1720   1.232    0.218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.55) family taken to be 1)
## 
##     Null deviance: 302.18  on 251  degrees of freedom
## Residual deviance: 300.66  on 250  degrees of freedom
## AIC: 2088.5
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.5500 
##           Std. Err.:  0.0466 
## 
##  2 x log-likelihood:  -2082.5280
check_overdispersion(m3_nb)
## # Overdispersion test
## 
##  dispersion ratio = 1.398
##           p-value = 0.088
check_zeroinflation(m3_nb)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 30
##             Ratio: 1.12
check_predictions(m3_nb)

check_model(m3_nb)


Step 6: Compare models

a. Use the export_summ() function from the jtools package to look at the three regression models you fit side-by-side.

c. Write a short paragraph comparing the results. Is the treatment effect robust or stable across the model specifications.

The treatment effect is stable across Poisson and NB models, and significantly different for our OLS model. The statistical significance is highly variable. The OLS model estimates that MPA sites have 5.36 more lobsters on average, but this effect is not statistically significant indicated by the p value being larger than 0.05. The OLS model likely has more error than the other models because of the distribution of our lobster data being overdispersed and highly variable. OLS assumes that residuals are normally distributed, which is not the case for this data. The Poisson model shows a highly significant effect (coefficient = 0.21, p < 0.001), translating to about 23% more lobsters in MPAs. The NB model produces an identical coefficient estimate to the Poisson model but with a much larger standard error (0.17 vs 0.03). The loss of significance in the NB model is because NB accounts for the overdispersion in the data, which the Poisson model ignores.

export_summs(m1_ols, m2_pois, m3_nb,  # Your three models
             model.names = c("OLS", "Poisson", "NB"),
             statistics = "none")
OLSPoissonNB
(Intercept)22.73 ***3.12 ***3.12 ***
(3.57)   (0.02)   (0.12)   
mpaMPA5.36    0.21 ***0.21    
(5.20)   (0.03)   (0.17)   
*** p < 0.001; ** p < 0.01; * p < 0.05.

Step 7: Building intuition - fixed effects

a. Create new df with the year variable converted to a factor

b. Run the following negative binomial model using glm.nb()

  • Add fixed effects for year (i.e., dummy coefficients)
  • Include an interaction term between variables treat & year (treat*year)

c. Take a look at the regression output. Each coefficient provides a comparison or the difference in means for a specific sub-group in the data. Informally, describe the what the model has estimated at a conceptual level (NOTE: you do not have to interpret coefficients individually)

This model shows us how many lobsters were at control sites in 2012, and how that changed each year. It also shows how the effect of our treatment has changed over the years. Basically, it’s showing us if the ‘MPA effect’ is consistent over time. Our results shows that the MPA has generally become more effective as time has gone on, with some decrease in efficacy in 2016 and 2017.

d. Explain why the main effect for treatment is negative? *Does this result make sense?

This means that in 2012 MPA sites had less lobsters than non-MPA. This makes sense because the MPAs were established in 2012, and they were established at areas that had higher fishing pressure.

ff_counts <- spiny_counts %>% 
    mutate(year=as_factor(year))
    
m5_fixedeffs <- glm.nb(
    counts ~ 
        treat +
        year +
        treat*year,
    data = ff_counts)

summ(m5_fixedeffs, model.fit = FALSE)
Observations 252
Dependent variable counts
Type Generalized linear model
Family Negative Binomial(0.8129)
Link log
Est. S.E. z val. p
(Intercept) 2.35 0.26 8.89 0.00
treat -1.72 0.42 -4.12 0.00
year2013 -0.35 0.38 -0.93 0.35
year2014 0.08 0.37 0.21 0.84
year2015 0.86 0.37 2.32 0.02
year2016 0.90 0.37 2.43 0.01
year2017 1.56 0.37 4.25 0.00
year2018 1.04 0.37 2.81 0.00
treat:year2013 1.52 0.57 2.66 0.01
treat:year2014 2.14 0.56 3.80 0.00
treat:year2015 2.12 0.56 3.79 0.00
treat:year2016 1.40 0.56 2.50 0.01
treat:year2017 1.55 0.56 2.77 0.01
treat:year2018 2.62 0.56 4.69 0.00
Standard errors: MLE

e. Look at the model predictions: Use the interact_plot() function from package interactions to plot mean predictions by year and treatment status.

f. Re-evaluate your responses (c) and (b) above. This figure shows us that in 2014, 2015 and 2018 MPA sites had higher counts than non-mpa sites. This is similar to what we observed from the table. We also see the low count of lobsters in MPA sites compared to non-MPA in 2012, that makes sense for the explanation above

interact_plot(m5_fixedeffs, pred = year, modx = treat,
              outcome.scale = "link") # NOTE: y-axis on log-scale

# HINT: Change `outcome.scale` to "response" to convert y-axis scale to counts

g. Using ggplot() create a plot in same style as the previous interaction plot, but displaying the original scale of the outcome variable (lobster counts). This type of plot is commonly used to show how the treatment effect changes across discrete time points (i.e., panel data).

The plot should have… - year on the x-axis - counts on the y-axis - mpa as the grouping variable

# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor

plot_counts <-  ggplot(spiny_counts, aes(x = year, y = counts, color = mpa, group = mpa)) +
  stat_summary(fun = mean, geom = "point", size = 3) +  # Mean points
  stat_summary(fun = mean, geom = "line", linewidth = 1) +  # Connect means with lines
  labs(
    x = "Year",
    y = "Counts",
    color = "MPA Status",
    title = "Lobster Counts Over Time by MPA Status"
  )

plot_counts

# plot_counts %>% ggplot() ...

Step 8: Reconsider causal identification assumptions

  1. Discuss whether you think spillover effects are likely in this research context (see Glossary of terms; https://docs.google.com/document/d/1RIudsVcYhWGpqC-Uftk9UTz3PIq6stVyEpT44EPNgpE/edit?usp=sharing)

Spillover effects are likely in this research. Natural processes like storms, currents and shifting temperatures can cause individual lobsters or groups of lobsters to migrate between test sites. This means that lobsters counted at one site could be counted at another site within the same year. This also means that lobsters than benefited (survived) from living in an MPA site could find themselves counted in a non-MPA site.

  1. Explain why spillover is an issue for the identification of causal effects

Spillover is in violation of the assumption that control groups are not under the influence of our treatment. If individuals from our treatment group are wandering into the bin where all of the treated individuals are being counted, our data will be inaccurate.

  1. How does spillover relate to impact in this research setting?

As explained above, lobsters that lived their entire lives in an MPA (and thus potentially had a higher chance of survival) wander into a non-MPA site right before they’re counted, our data assumes they spent their entire lives in a non-MPA site, which is not accurate. This could work the other direction as well.

  1. Discuss the following causal inference assumptions in the context of the MPA treatment effect estimator. Evaluate if each of the assumption are reasonable:

    1. SUTVA: Stable Unit Treatment Value assumption

    SUTVA assumes that there is no spillover, and that our non-MPA groups are not experiencing any of the effects of the MPA zone. This is not reasonable for this situation, because the existence of MPA sites means that more fishermen will have to move to non-MPA sites. Meaning lobsters that may not have been caught in non-MPA sites may now have an increased chance of being caught. Since all of the lobsters are living in the same ocean together, we cannot make this assumption. More controls need to be established, which might not be possible for this specific experiment. SUTVA also assumes that all units in the experiment are recieving the same treatment, and that there is only one version of the treatment. This assumption is not reasonable, because every MPA is going to unique. Each MPA will have slightly different geography/habitat/neighboring fishing pressure etc..

    1. Excludability Assumption

    Excludability assumption requires that the treatment is only affecting the outcome through the intended causal pathway. In this context, that would mean that lobster increases are only due to the absence of fisherman encouraged by the MPA rules. This assumption is more reasonable, but we cannot assume that the establishment of MPAs didn’t also change the habits of fisherman - who now likely fish the edge of the MPA. This habitual change is not the intended causal pathway, but would very likely have an effect on lobster populations.

    2.1) Exogeneity assumption

    Exogeneity assumption assumes that the treatment assignment is independent of other factors affecting the outcome. This assumption is not reasonable for this experiment, because there are countless things that could also be causing the effect that we see. Theoretically (but ridiculously) the price of lobster, presence of pirates, or the regional release of a Disney animated movie about lobsters could all produce effects comparable to what we observed. Without accounting for more of these things, we cannot assume that our observed effect on lobster count is solely due to the presence of MPAs, but if similar experiments with varying controls give similar results for many years, this study and others can serve as evidence to the efficacy of MPAs.


EXTRA CREDIT

Use the recent lobster abundance data with observations collected up until 2024 (extracredit_sblobstrs24.csv) to run an analysis evaluating the effect of MPA status on lobster counts using the same focal variables.

  1. Create a new script for the analysis on the updated data
  2. Run at least 3 regression models & assess model diagnostics
  3. Compare and contrast results with the analysis from the 2012-2018 data sample (~ 2 paragraphs)